Diabetes (mellitus) is a widespread chronic, long-lasting, health condition with a growing global prevalence. This disease affects how your body uses blood sugar (glucose) to turn food into energy. With diabetes, the body does not produce enough insulin or cannot use it properly. This causes too much blood sugar to accumulate in the bloodstream, which over time, can cause serious health problems including heart disease, loss of vision, and kidney disease. There are various types of diabetes, that differ in certain aspects, yet they share a multitude of common symptoms, causes, and effects. An estimated number of 537 million adults were said to have diabetes in 2021, and according to the WHO, both the number of cases and the prevalence of diabetes have been steadily increasing over the past few decades. There is a huge amount of risk factors that play a role in causing diabetes. However, it is important to note that the exact cause of most types of diabetes is unknown, but can be attributed to a combination of genetic or environmental factors in most cases.
In the following, we will be developing different models that can identify, individuals at risk of developing diabetes based on a dataset of various features: Gender, Age, Hypertension, Heart Disease, Smoking History, BMI, HbA1c (glycated hemoglobin) Level, and Blood Glucose Level.
Based on scientific literature, the following is known about each of the features in relation to the disease:
Gender: The overall global prevalence of diabetes is higher in men, even though there are more women with type 2 diabetes than men. Hence, knowing that the dataset does not include the type of diabetes each individual has, we cannot yet infer the magnitude of the effect this feature will have in terms of predictive power.
Age: While type 1 usually develops in childhood or early adulthood, people 45 years or older have a higher risk of developing type 2 diabetes. So we might expect this feature to not have a significant predictive power, in terms of overall diabetes prediction.
Hypertension: According to Johns Hopkins Medicine, if you have diabetes, you are twice as likely to have high blood pressure. Hence, we can expect this feature to impact our prediction accuracy.
Heart Disease: People with diabetes are 2 to 4 times more likely to develop cardiovascular disease (Johns Hopkins Medicine). Likewise, this feature is expected to impact the results.
Smoking History: People who smoke are 30% to 40% more likely to develop type 2 diabetes (CDC), as nicotine changes cells so they don’t respond to insulin, which increases blood sugar levels. In this particular case, the frequency of smoking, which is not included in the dataset, can impact the likeliness of developing diabetes. So the importance of this feature in our model is yet to be clearly understood.
BMI: An increase in BMI is related to increase in the risk of diabetes. Being Overweight (BMI of 25-29.9), or affected by obesity (BMI of 30-39.9) or morbid obesity (BMI of 40 or greater) greatly increases the risk of developing diabetes. Accordingly, we might expect this feature to be discriminative as the exact BMI is provided.
HbA1c Level: Average blood sugar levels over the past 3 months. A value below 5.7% can be used to classify a person as non-diabetic, while a value of 6.5% or higher can be used to diagnose a person as having diabetes. This is normally very important to take into consideration when trying to determine if a person has diabetes.
Blood Glucose Level: As previously mentioned, diabetes causes blood glucose levels to rise higher than normal (hyperglycemia). A fasting blood sugar level of 99mg/dL is normal, while 126 mg/dL or higher indicates you are likely to have diabetes (CDC).As hyperglycemia is one of the main direct consequences of diabetes, we can expect this feature to help in accurately predicting the disease.
It is important to identify diabetes as early as possible as it can help prevent or delay the development of the complications and effects previously mentioned. Moreover, early detecting can lead to lower healthcare costs by preventing costly implication. Hence, early detection can help reduce the overall burden that this increasingly frequent disease has been imposing on healthcare systems.
The dataset consists of data related to diabetes, with the following characteristics:
Number of Rows: 100,000
Number of Columns: 9
The dataset includes the following columns, which represent the various features:
Gender: Categorical data indicating the gender of the individual (Male, Female).
Age: Numeric data indicating the age of the individual.
Hypertension: Binary data (0 or 1) indicating the presence or absence of hypertension.
Heart Disease: Binary data (0 or 1) indicating the presence or absence of heart disease.
Smoking History: Categorical data indicating the smoking history (ever, never, current smoker, not current smoker, former, or no information).
BMI (Body Mass Index): Numeric data indicating the BMI of the individual.
HbA1c Level: Numeric data indicating the HbA1c level, a measure of blood sugar control over the past 2 to 3 months.
Blood Glucose Level: Numeric data indicating the blood glucose level of the individual.
Diabetes: Binary data (0 or 1) indicating the presence or absence of diabetes.
Our team has selected a subset of the dataset “diabetes_subset.csv” which consists of 1,000 records extracted from the original “diabetes.csv” dataset. The original dataset contained 8 features alongside a target label, spanning a total of 100,000 observations. In creating this subset, we maintained the same proportion of class labels as found in the full dataset. Additionally, we deliberately excluded any observations where the values “ever” and “not current” were present since:
ever: Usually means the individual has smoked at some point in their life, but it’s not clear from the context whether they currently smoke or not.
not current: Could imply that the person used to smoke but does not currently, although it’s somewhat ambiguous and could overlap with ‘former’.
After having selected an appropriate dataset for our topic of interest, we now need to understand its characteristics and get insight from them. In order to do that,we will start by identifying any missing values and duplicates, as well as getting an overview of its size through the number of rows and variables present. After establishing this fundamental understanding, we’ll delve into visualizing our data using various techniques including various graph types, such as histograms, bar charts, and heatmap. This approach will enable us to effectively display different aspects of our dataset, facilitating a deeper understanding through comparative and pattern analysis. This initial step is crucial in setting the stage for more advanced data analysis and insights extraction.
Note: Please be aware that the heatmap will be applied following the conversion of categorical data into numeric values during the data preprocessing stage, as heatmaps work exclusively with numeric data.
## Warning: package 'readr' was built under R version 4.3.2
dimension of our dataset?## [1] 1000 9
variables in our dataset?## [1] "gender" "age" "hypertension"
## [4] "heart_disease" "smoking_history" "bmi"
## [7] "HbA1c_level" "blood_glucose_level" "diabetes"
# Load necessary libraries
library(tidyverse)
library(ggplot2)
library(gt)
library(caret)
library(MASS) # for LDA and QDA
library(pROC) # for ROC curves
library(broom)
library(dplyr)
library(readr)
library(caret)
library(knitr)
library(reshape2)
library(corrplot)
library(tidyr)
library(pROC)
library(boot)
library(class)
library(ggthemes) # for additional themes
library(ggrepel)
library(gridExtra)
library(pheatmap)
library(randomForest)
library(kernlab)missing_values_per_column <- sapply(data, function(x) sum(is.na(x)))
print(missing_values_per_column)## gender age hypertension heart_disease
## 0 0 0 0
## smoking_history bmi HbA1c_level blood_glucose_level
## 0 0 0 0
## diabetes
## 0
duplicate_rows <- duplicated(data)
number_of_duplicates <- sum(duplicate_rows)
print(number_of_duplicates)## [1] 1
looks like?| gender | age | hypertension | heart_disease | smoking_history | bmi | HbA1c_level | blood_glucose_level | diabetes |
|---|---|---|---|---|---|---|---|---|
| Female | 9 | 0 | 0 | No Info | 27.32 | 5.8 | 158 | 0 |
| Female | 53 | 0 | 0 | never | 27.32 | 4.5 | 80 | 0 |
| Female | 57 | 1 | 0 | current | 27.32 | 8.2 | 126 | 1 |
| Male | 10 | 0 | 0 | No Info | 14.39 | 4.8 | 126 | 0 |
| Female | 64 | 0 | 0 | No Info | 27.67 | 5.7 | 145 | 1 |
In our data analysis, we’ll use histograms to visualize numerical data, effectively revealing distribution patterns like skewness and outliers through frequency bins. For categorical data, bar graphs will be our choice, offering a clear comparison of category frequencies, which is invaluable for identifying prevalent groups and observing trends across different categories. These visualization techniques are essential for gaining a deeper understanding of both numerical and categorical aspects of our dataset.
## gender age hypertension heart_disease
## Length:1000 Min. : 0.24 Min. :0.000 Min. :0.000
## Class :character 1st Qu.:23.00 1st Qu.:0.000 1st Qu.:0.000
## Mode :character Median :42.00 Median :0.000 Median :0.000
## Mean :41.54 Mean :0.079 Mean :0.044
## 3rd Qu.:60.00 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :80.00 Max. :1.000 Max. :1.000
## smoking_history bmi HbA1c_level blood_glucose_level
## Length:1000 Min. :10.40 Min. :3.500 Min. : 80.0
## Class :character 1st Qu.:23.60 1st Qu.:4.800 1st Qu.:100.0
## Mode :character Median :27.32 Median :5.800 Median :140.0
## Mean :27.01 Mean :5.497 Mean :139.5
## 3rd Qu.:29.21 3rd Qu.:6.200 3rd Qu.:159.0
## Max. :60.20 Max. :9.000 Max. :300.0
## diabetes
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.085
## 3rd Qu.:0.000
## Max. :1.000
Age Distribution: Exhibits a bimodal pattern with peaks around young adults and middle-aged individuals, indicating a concentration of data in these age groups.
BMI Distribution: Shows a right-skewed distribution with a concentration in the ‘overweight’ category, suggesting a prevalence of higher BMI values in the dataset.
HbA1c Level Distribution: Displays a roughly normal but slightly right-skewed distribution, with most values falling within the normal to well-controlled diabetic range.
Blood Glucose Level Distribution: Is right-skewed with notable spikes around the normal fasting level and the diabetic threshold, indicating distinct groupings possibly related to diabetes classification.
Hypertension Distribution: A majority of individuals do not have hypertension (indicated by 0), with a small proportion of the dataset having hypertension (indicated by 1).
Heart Disease Distribution: Similarly, most individuals in the dataset do not have heart disease, while a minority does.
Note that even though we have mentioned that diabetes is likely to cause the previous two conditions, it does not mean that every person with diabetes will develop them, nor does it mean that every individual exhibiting these conditions will necessarily be diabetic.
These distributions suggest that the dataset primarily consists of individuals without these conditions, which could be relevant when analyzing the overall health profile or risk factors associated with the study’s main condition of interest, diabetes.
ggplot(diabetes_data, aes(x = (diabetes), fill = factor(diabetes))) +
geom_bar() +
scale_fill_manual(values=c("0" = "lightblue", "1" = "pink")) +
labs(title = "Class Distribution of Diabetes Output",
x = "Diabetes (0 = No, 1 = Yes)",
y = "Count") +
theme_minimal()Implications for Model Performance Metrics: These results help us make the following inferences:
These alternative metrics should be considered for a more suitable assessment of a model’s performance, especially in imbalanced datasets like the one observed for diabetes classification.
Before we proceed onto building our models, we need to clean and preprocess the dataset in order to make it more suitable for the machine learning algorithm. Data preprocessing serves as the crucial foundation upon which our models are built. This section is dedicated to the essential task of cleaning and transforming the dataset to render it suitable and optimized for machine learning algorithms. Through careful handling of the data, we aim to enhance its quality, consistency, and compatibility, ultimately facilitating the creation of robust and accurate predictive models.
smoking_history =
“No Info” from the dataset is primarily motivated by the
need to ensure data quality and relevance, particularly in studies where
smoking status is a crucial factor. This step helps maintain the
integrity of statistical analyses by focusing on complete and consistent
information, which is essential for accurate modeling and analysis. It
avoids the potential biases and reduced model performance that can arise
from including data points with missing or undefined key
information.| gender | age | hypertension | heart_disease | bmi | HbA1c_level | blood_glucose_level | smoking_historycurrent | smoking_historyever | smoking_historyformer | smoking_historynever | smoking_historynot.current | diabetes |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 53 | 0 | 0 | 27.32 | 4.5 | 80 | 0 | 0 | 0 | 1 | 0 | 0 |
| 0 | 57 | 1 | 0 | 27.32 | 8.2 | 126 | 1 | 0 | 0 | 0 | 0 | 1 |
| 0 | 21 | 0 | 0 | 27.32 | 3.5 | 160 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 19 | 0 | 0 | 25.94 | 6.0 | 126 | 0 | 0 | 0 | 1 | 0 | 0 |
| 0 | 35 | 0 | 0 | 27.32 | 6.1 | 126 | 1 | 0 | 0 | 0 | 0 | 0 |
Converting gender from categorical
(‘Male’, ‘Female’) to binary numeric values (1
for Male, 0 for Female) is done to facilitate the use of these variables
in statistical and machine learning models (p.s. decision tree can
handle categorical data). Many such models require numerical input and
cannot process categorical data in its original form. By encoding
‘gender’ as 0 and 1, it simplifies the dataset and makes it compatible
with a wider range of algorithms that expect numerical input.
One-hot encoding “smoking history” is crucial for converting categorical data into a numerical format suitable for machine learning algorithms. It avoids implying false ordinal relationships between categories and can enhance model performance. This process introduces higher dimensionality, but it’s a worthwhile trade-off for more accurate and effective data representation in many analytical models.
Diabetes has a positive correlation with
smoking_historyformer, suggesting that individuals with a
history of smoking may have a higher likelihood of diabetes,
which is consistent with the information gotten from the
literature.
There is a moderate positive correlation between
diabetes and heart_disease, indicating a
possible association where individuals with heart disease may
have an increased possibility of being diabetic.
The correlation between diabetes and
blood_glucose_level is positive and
strong, which aligns with medical knowledge since, as
previously mentioned, elevated blood glucose levels are a hallmark of
diabetes.
Likewise, a similar positive correlation is noted
between diabetes and HbA1c_level, another
clinical marker used to diagnose and monitor diabetes.
The correlations of diabetes with bmi,
hypertension, and age are present but
relatively weaker, suggesting these factors may have a less
direct or more complex relationship with diabetes. This makes sense
for age, as the correlation between age and diabetes is dependent on the
type of diabetes like we previously said. Moreover, hypertension is a
possible repercussion of diabetes, not a cause that can be used
individually for prediction, so this makes sense as well, even though it
does not align with our previous expectations. Similarly, we expected
BMI to have better prediction power than it actually does.
Diabetes does not show a strong
correlation with gender or different categories of
smoking_history other than former, indicating
these factors may not be as significant in predicting diabetes in this
dataset. For gender, it probably goes back to the limitation of not
having the diabetes type being provided, which as we previously
mentioned, is where diabetes starts having varying prevalence between
genders.
Now, we need to split the dataset into training and test sets. In order to do that, we will begin by establishing a random seed to ensure reproducibility. Subsequently, we will employ a method to perform the actual data split.
Our chosen configuration involves dividing the dataset into an 80% Training Set and a 20% Test Set.
Selecting this 80-20 split for a dataset consisting of 950 observations is a judicious decision. It provides us with 760 observations for robust training, which is crucial when dealing with complex datasets. Simultaneously, it reserves 190 observations for testing, enabling us to effectively evaluate the model’s performance. This partitioning strategy effectively circumvents the shortcomings of both a 50-50 split (which lacks sufficient training data) and a 70-30 split (which results in an excessively large test set). This balanced approach strikes a harmonious balance between reducing bias in training and assessing model variance during testing.
To accomplish this split, we will utilize the createDataPartition function from the caret package. This choice is preferred over the sample function because while sample() provides a basic means of random sampling for data splitting, createDataPartition() offers a more specialized and balanced approach. This specialization proves particularly advantageous when dealing with classification problems, as it helps maintain appropriate class proportions in the split dataset.
We plan to employ the Random Forest algorithm for selecting the most important features from our original dataset, which notably includes categorical features. The choice of Random Forest is strategic, as it’s renowned for its robustness and efficiency in handling various types of data, including categorical variables. This method operates by constructing numerous decision trees during the training process and outputting the mode of the classes (classification) or mean prediction (regression) of the individual trees. This ensemble approach not only enhances the predictive accuracy but also helps in avoiding overfitting, a common pitfall in complex models. Additionally, Random Forest has an inherent feature importance measurement, providing valuable insights into which features contribute most significantly to the predictive model. This makes it an ideal tool for our feature selection process, ensuring that we focus on the most relevant variables, thereby improving the efficiency and effectiveness of our subsequent modeling efforts.
## Feature Importance
## 8 blood_glucose_level 32.6679179
## 7 HbA1c_level 24.8429745
## 6 bmi 3.1911015
## 3 hypertension 2.3128509
## 4 heart_disease 2.1598332
## 1 gender 0.4785635
## 5 smoking_history -0.1109029
## 2 age -0.3563031
We used the ‘data’ dataset consisting of the 950 observations and the original 9 variables to fit the random forest model, and test the importance of our features.
We systematically varied the number of trees, spanning the range from 50 to 500, incrementing by 50 with each iteration. For each setting, we meticulously conducted a 5-fold cross-validation procedure, ensuring the robustness and generalizability of our results. During this process, we calculated the average feature importance across the 5 folds for each specific number of trees. This iterative approach allowed us to comprehensively assess how the importance of each feature evolved in response to the changing complexity of the model.
To consolidate our findings and gain a holistic understanding of feature importance, we took the extra step of averaging the importance scores across the diverse set of tree configurations we experimented with. This methodical analysis not only helped us identify the significance of individual features but also provided us with valuable insights into how the Random Forest model responded to different levels of tree diversity, ultimately enabling us to make informed decisions about feature selection and model refinement.
In our evaluation we employed the
'MeanDecreaseAccuracy' metric to assess how model accuracy
changes when we permute a feature’s values. This method provides us with
a valuable estimate of each feature’s importance, with higher scores
indicating a more substantial decrease in accuracy when that feature’s
values are shuffled. This drop in accuracy signifies the significance of
the feature to the model.
HbA1c_level: This feature has the highest importance score of approximately 36.48, highlighting its critical role in predicting diabetes within the dataset. It’s likely the most significant indicator among all the features considered. This makes sense as it is used to clinically classify a person as diabetic or non-diabetic based on certain thresholds.
blood_glucose_level: Coming in with a score of approximately 28.40, this feature is also highly influential in the model, which aligns with medical knowledge that blood glucose levels are a primary diagnostic criterion for diabetes.
age: Age has an importance score of about 3.19, suggesting it has a moderate influence on the model’s predictions. This does not align with the previously made assumptions, which again highlights the importance of this process to gain insight that might not have been obvious.
bmi: With a score around 2.37, the body mass index is another moderate predictor. Higher BMI is often associated with increased risk of diabetes.
smoking_history: This has a lower importance score of approximately 1.27, indicating a smaller but still significant influence on diabetes risk prediction.
hypertension: The score for hypertension is about 0.70, which suggests it has a minor influence on the prediction model.
gender: Gender shows a small positive importance score of approximately 0.11, indicating that while gender may play a role in the model, its impact is limited. This aligns with the previously addressed issue of not having the diabetes type, which is more correlated with gender than just diabetes in general.
heart_disease: Interestingly, heart disease has a negative importance score of about -0.37. This does not necessarily mean that heart disease negatively impacts the model’s accuracy. Instead, it may indicate that heart disease is less predictive of diabetes compared to other features, at least within the context of this Random Forest model.
Consequently, our analysis revealed that the gender, heart disease, and hypertension features consistently ranked as the least important in our models. In other words, these contribute the least to solving the problem. As a result, we are inclined to remove these features to conduct further testing and gauge the predictive power of our models without their influence.
To start, we’ll initially run the random forest model with all the features included. We’ll assess its accuracy and F1-score. Afterward, we’ll rerun the model, this time excluding the gender and hypertension variables. This comparative analysis aims to confirm that these variables indeed lack significant predictive power, as indicated in the feature selection section.
## [1] "F1 Score: 0.982905982905983"
## [1] "Accuracy: 0.968503937007874"
gender,
heart disease, and hypertension features## [1] "F1 Score: 0.978723404255319"
## [1] "Accuracy: 0.960629921259842"
As demonstrated above, both the accuracy and F1 score remained unchanged after the removal of these three features. This observation strongly suggests that these three features hold little to no significance in predicting the diabetes label.
However, it’s essential to note that the removal of any other feature, particularly the most critical ones like HbA1c_level and blood_glucose_level, resulted in a significant decrease in these evaluation metrics. This highlights the pivotal role played by these two features in our model’s predictive power.
We will train our model on a refined dataset, one that excludes the features that we’ve identified as less significant for our Bagging model. Bagging operates by creating multiple subsets of the dataset, each with its own variations, and trains individual models on these subsets. These models, often decision trees, work in parallel to provide predictions. In the end, Bagging combines the predictions from these models to generate a more robust and accurate overall prediction. This technique is particularly effective in reducing overfitting and improving generalization, making it a valuable tool in our model-building arsenal.
## [1] "Accuracy: 0.968503937007874"
## [1] "F1 Score: 0.8"
## [1] "Cross-Validated Accuracy: 0.94488188976378"
## [1] "Cross-Validated F1 Score: 0.96969696969697"
## CART
##
## 510 samples
## 5 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 459, 459, 459, 459, 459, 459, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 1e-04 0.9509804 0.7370717
## 2e-04 0.9509804 0.7370717
## 3e-04 0.9509804 0.7370717
## 4e-04 0.9509804 0.7370717
## 5e-04 0.9509804 0.7370717
## 6e-04 0.9509804 0.7370717
## 7e-04 0.9509804 0.7370717
## 8e-04 0.9509804 0.7370717
## 9e-04 0.9509804 0.7370717
## 1e-03 0.9509804 0.7370717
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.001.
## [1] "Cross-Validated Accuracy: 0.94488188976378"
## [1] "Cross-Validated F1 Score: 0.96969696969697"
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
## [75] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
## [112] 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0
* Pruned Vs Unpruned:
F1 Score: Pruned Tree: The F1 score for the pruned tree is 98.30%, which is extremely high. Unpruned Tree: The unpruned tree has an F1 score of 96.23%. While this is also a strong score, it is slightly lower than the pruned tree.
Pruning a decision tree (by adjusting the complexity parameter cp) helps in reducing the size of the tree, making it simpler and more generalizable. A pruned tree typically has fewer nodes and depth, reducing the risk of overfitting to the training data.
An unpruned tree, on the other hand, can grow very large and complex. It can perfectly fit the training data, capturing all its nuances and noise, which often leads to overfitting. This means it might not perform as well on new, unseen data.
Random Forest: This curve is closest to the top-left corner, which suggests that the Random Forest model has the highest True Positive Rate (TPR) for most thresholds and the lowest False Positive Rate (FPR). It demonstrates the best discriminatory ability among the three models.
Bagging: The Bagging model’s ROC curve is also quite close to the top-left corner but not as close as the Random Forest model’s curve. This indicates that while the Bagging model is performing well, it is not as effective at classifying the positive class as the Random Forest model.
Pruned Decision Tree: The curve for the Pruned Decision Tree is the furthest from the top-left corner compared to the other two. This suggests that it has the lowest TPR and the highest FPR among the three models, which indicates that it has the worst performance in terms of ROC.
When comparing models using ROC curves, the Area Under the Curve (AUC) can also provide a single measure of performance regardless of the classification threshold. The model with the highest AUC is generally considered the best. The curve that is highest and most to the left is the best model.
In our analysis, we observed that the F1 score for the Random Forest model was exceptionally high at 98.30%. Comparatively, the Bagging model scored 76.92%, and the Pruned Decision Tree model also performed well with a score of 96.23%. It is relatively uncommon for a Decision Tree, even a pruned one, to achieve an F1 score close to that of a Random Forest. This unusual result prompted a thorough investigation into the possible causes.
We undertook various preprocessing techniques on our dataset to understand this anomaly. These techniques included working with the original dataset containing categorical data, encoding the categorical data and using the dataset comprising solely numeric data, and trying to fit the dataset after standardizing and normalizing the data. Despite these varied approaches, the outcomes remained consistently similar across different model runs. This consistency suggests that there might be underlying factors at play, potentially related to the nature of the dataset or the inherent characteristics of the models used.
Moreover, a deeper insight was gained when we analyzed the ROC (Receiver Operating Characteristic) curves of these models. The ROC curves clearly delineated the performance differences among the models, aligning more with our initial expectations. The ROC analysis provided a more nuanced view of the models’ performance, highlighting aspects that the F1 score alone might have obscured. This discrepancy between the F1 scores and the ROC curve analysis underscores the importance of utilizing multiple evaluation metrics to gain a comprehensive understanding of model performance.
After exploring Decision Trees and Random Forests, we now advance to Support Vector Machines (SVM) for their superior handling of complex classification boundaries. SVM stands out with its use of kernel functions, allowing it to efficiently process linear and non-linear data.
Transitioning to SVM Decision Trees and Random Forests, while effective, can fall short in complex scenarios. SVM, with its ability to maximize the margin between classes, offers a more robust solution, especially when dealing with non-linear boundaries.
Exploring SVM Kernels We will experiment with three SVM kernels:
Comparative Analysis Finally, we’ll compare the performance of each kernel to determine the most effective approach for our dataset, leveraging the strengths of SVM in contrast to Decision Trees and Random Forests.
Lets start with linear kernel:Linear Kernel is used when the data is Linearly separable.
## [1] "F1 Score: 0.96234309623431"
## [1] "Accuracy: 0.929133858267717"
Above we used the most common value of C which is 0.01
now we will explore more values of C and how it will affect the
performance. Steps: 1. Defining the Hyperparameter
Range: A range of values for the hyperparameter ‘C’ (the
regularization parameter in SVM) is defined. This range spans from \(10^{-3}\) to \(10^{2}\), increasing in steps of \(0.5\) on a logarithmic scale.
Setting Up Cross-Validation: The
trainControl function is used to set up a 10-fold
cross-validation process. The twoClassSummary function is
specified as the summaryFunction to compute performance
metrics suitable for binary classification, including the ROC
metric.
Training the SVM Model: The train
function from the caret package is used to train the SVM
model. This function internally performs grid search across the defined
range of ‘C’ values. The model’s performance is evaluated using the Area
Under the ROC Curve (AUC) metric (metric = "ROC").
Selecting the Best Model: After training, the model with the best performance (highest ROC value) is selected. The corresponding ‘C’ value is identified as the best hyperparameter setting.
library(caret)
library(e1071)
df_encoded$diabetes <- factor(df_encoded$diabetes, levels = unique(df_encoded$diabetes), labels = c("0", "1"))
train_y_encoded <- factor(train_y_encoded, levels = c("0", "1"), labels = c("Class0", "Class1"))
test_y_encoded <- factor(test_y_encoded, levels = c("0", "1"), labels = c("Class0", "Class1"))
# Define a range of C values to try
c_values <- 10^seq(-3, 2, by = 0.5)
# Set up train control for cross-validation with F1 score as the metric
train_control <- trainControl(method = "cv", number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE,
savePredictions = "final")
# Train the SVM model using caret's train function with F1 score as the metric
set.seed(123)
tuned_svm <- train(x = train_x_encoded, y = train_y_encoded, method = "svmLinear",
trControl = train_control,
tuneGrid = expand.grid(C = c_values),
metric = "ROC")
# Print the best C value based on F1 score
best_c_value <- tuned_svm$bestTune$C
print(paste("Best C Value:", best_c_value))## [1] "Best C Value: 0.1"
# Predict on the test data using the best model
svm_predictions <- predict(tuned_svm, test_x_encoded)
# Compute the confusion matrix
conf_matrix <- confusionMatrix(svm_predictions, test_y_encoded)
# Extract F1 score and accuracy from the confusion matrix
f1_score <- conf_matrix$byClass["F1"]
accuracy <- conf_matrix$overall["Accuracy"]
# Print the results
print(paste("F1 Score:", f1_score))## [1] "F1 Score: 0.970212765957447"
## [1] "Accuracy: 0.94488188976378"
The output of our SVM model training and evaluation indicates a high level of performance in classifying the given label. There was an improvement in performance after changing the cost value. The optimal hyperparameter C, chosen for regularization in the SVM model, is 0.1, suggesting a balance between model complexity and generalization. The F1 score, a balanced measure of precision and recall, is impressively high at approximately 0.975, indicating excellent model accuracy and reliability in both identifying and classifying positive cases.
## [1] "AUC for SVM with Linear Kernel: 0.968840579710145"
We will now transition to using the Radial Basis Function (RBF) kernel, a powerful kernel type in SVM modeling. The Radial kernel, also known as the Gaussian kernel, transforms the feature space into a higher dimension where linear separation of the classes is more feasible, making it highly effective for non-linear datasets.
set.seed(123)
# Train the SVM model with a linear kernel
svm_model <- svm(train_x_encoded, train_y_encoded, kernel = "radial", gamma= 0.001)
# Predict on the test data
svm_predictions <- predict(svm_model, test_x_encoded)
# Compute the confusion matrix
conf_matrix <- confusionMatrix(svm_predictions, test_y_encoded)
# Extract F1 score and accuracy from the confusion matrix
f1_score <- conf_matrix$byClass["F1"]
accuracy <- conf_matrix$overall["Accuracy"]
# Print the results
print(paste("F1 Score:", f1_score))## [1] "F1 Score: 0.950413223140496"
## [1] "Accuracy: 0.905511811023622"
Above we used the most common value of gamma which is
0.001 now we will explore more values of gamma and how it
will affect the performance. Steps: 1. Defining the
Hyperparameter Range: A range of gamma values for the SVM with
a Radial Basis Function (RBF) kernel is defined, spanning from
10^(-3) to 10^(3). These values are chosen to
explore a wide spectrum on a logarithmic scale, incrementing in steps of
0.5.
Setting Up Cross-Validation: Cross-validation is
established using the tune.control function, specifying a
10-fold cross-validation strategy (cross = 10). This
approach ensures a robust assessment of the model’s performance across
different subsets of the data.
Performing the Grid Search: The
tune function from the e1071 package in R is
employed to conduct a grid search over the defined range of gamma
values. This function systematically tests each gamma value, evaluating
its effect on the model’s performance.
Identifying the Optimal Hyperparameter: Post the
grid search, the best performing model is identified based on the
cross-validation results. The optimal gamma value is extracted
(best_gamma) and is considered the most effective for the
SVM model with the RBF kernel.
Training and Evaluating the Final Model: The SVM model is then retrained using the optimal gamma value. The model’s effectiveness is assessed on the test set, with key performance metrics like F1 score and accuracy derived from the confusion matrix, providing insights into the model’s classification capability.
set.seed(123)
# Define a range of gamma values to test
gamma_range <- 10^seq(-3, 3, by = 0.5)
# Set up the tune control to perform cross-validation
tune_control <- tune.control(sampling = "cross", cross = 10)
# Perform the cross-validation using the tune() function
svm_tune_result <- tune(
svm,
train.x = train_x_encoded,
train.y = train_y_encoded,
kernel = "radial",
ranges = list(gamma = gamma_range),
tunecontrol = tune_control
)
# Print the best model found
#print(svm_tune_result$best.model)
# Extract the best gamma value
best_gamma <- svm_tune_result$best.parameters$gamma
# Print the best gamma value
print(paste("Best gamma value:", best_gamma))## [1] "Best gamma value: 0.0316227766016838"
# Train the SVM model using the best gamma value
best_svm_model <- svm(train_x_encoded, train_y_encoded, kernel = "radial", gamma = best_gamma)
# Predict on the test data using the best model
best_svm_predictions <- predict(best_svm_model, test_x_encoded)
# Compute the confusion matrix for the best model
best_conf_matrix <- confusionMatrix(best_svm_predictions, test_y_encoded)
# Extract F1 score and accuracy from the confusion matrix
best_f1_score <- best_conf_matrix$byClass["F1"]
best_accuracy <- best_conf_matrix$overall["Accuracy"]
# Print the results for the best model
print(paste("Best F1 Score:", best_f1_score))## [1] "Best F1 Score: 0.974576271186441"
## [1] "Best Accuracy: 0.952755905511811"
The results from tuning the SVM model with the Radial Basis Function (RBF) kernel demonstrate a highly effective model for our dataset. The optimal gamma value found is approximately 0.0316, suggesting that this specific setting for the kernel width achieves a good balance between capturing the complexity of the data and avoiding overfitting. The F1 score, a critical metric that balances precision and recall, is remarkably high at around 0.972. This indicates that the model is exceptionally proficient at correctly classifying instances while maintaining a balance between precision and recall. Additionally, the overall accuracy of the model stands at approximately 94.7%, reinforcing its robustness in making correct predictions across both classes.
## [1] "AUC for SVM with Radial Kernel: 0.958695652173913"
We are now transitioning to explore the use of the polynomial (poly) kernel in our Support Vector Machine (SVM) modeling. The polynomial kernel, often simply referred to as the “poly kernel,” is a type of kernel function that represents the similarity of vectors (samples) in a feature space over polynomials of the original variables, allowing the model to capture more complex relationships. Unlike linear or radial kernels, the polynomial kernel considers not only the given features but also their interactions up to a certain degree, defined by the kernel’s degree parameter.
set.seed(123)
svm_model <- svm(train_x_encoded, train_y_encoded, kernel = "poly", degree = 2)
# Predict on the test data
svm_predictions <- predict(svm_model, test_x_encoded)
# Compute the confusion matrix
conf_matrix <- confusionMatrix(svm_predictions, test_y_encoded)
# Extract F1 score and accuracy from the confusion matrix
f1_score <- conf_matrix$byClass["F1"]
accuracy <- conf_matrix$overall["Accuracy"]
# Print the results
print(paste("F1 Score:", f1_score))## [1] "F1 Score: 0.974576271186441"
## [1] "Accuracy: 0.952755905511811"
Above we used the most common degree which is 2 now we
will explore more values of degree and how it will affect the
performance. Steps: 1. Defining the Hyperparameter
Range: For the SVM with a polynomial (poly) kernel, a range of
degree values is defined, typically ranging from 2 to 6. This range is
selected to explore the impact of different polynomial degrees on the
model’s ability to capture complex, non-linear relationships in the
data.
Setting Up Cross-Validation: Cross-validation is
configured using the tune.control function, specifying a
10-fold cross-validation method (cross = 10). This strategy
is essential to ensure a comprehensive evaluation of the model’s
performance across different segments of the dataset.
Performing the Grid Search: The
tune function from the e1071 package in R is
utilized to execute a grid search across the specified range of degree
values. This grid search systematically assesses each degree value to
understand its influence on the model’s effectiveness.
Identifying the Optimal Hyperparameter: After
completing the grid search, the model yielding the best performance
during cross-validation is identified. The optimal polynomial degree
(best_degree) is determined from this process, indicating
the most effective degree for the SVM model with the poly
kernel.
Training and Evaluating the Final Model: With the optimal degree value, the SVM model is retrained. The performance of this final model is then assessed on the test data, extracting critical metrics such as the F1 score and accuracy from the confusion matrix. These metrics offer valuable insights into the model’s classification accuracy and its overall predictive power.
set.seed(123)
# Define a range of degree values to test
degree_range <- 2:6 # Adjust this range based on what you deem appropriate
# Set up the tune control to perform cross-validation
tune_control <- tune.control(sampling = "cross", cross = 10)
# Perform the cross-validation using the tune() function
svm_tune_result <- tune(
svm,
train.x = train_x_encoded,
train.y = train_y_encoded,
kernel = "poly",
ranges = list(degree = degree_range),
tunecontrol = tune_control
)
# Print the best model found
print(svm_tune_result$best.model)##
## Call:
## best.tune(METHOD = svm, train.x = train_x_encoded, train.y = train_y_encoded,
## ranges = list(degree = degree_range), tunecontrol = tune_control,
## kernel = "poly")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## coef.0: 0
##
## Number of Support Vectors: 70
# Extract the best degree value
best_degree <- svm_tune_result$best.parameters$degree
# Print the best degree value
print(paste("Best degree value:", best_degree))## [1] "Best degree value: 3"
# Train the SVM model using the best degree value
best_svm_model <- svm(train_x_encoded, train_y_encoded, kernel = "poly", degree = best_degree)
# Predict on the test data using the best model
best_svm_predictions <- predict(best_svm_model, test_x_encoded)
# Compute the confusion matrix for the best model
best_conf_matrix <- confusionMatrix(best_svm_predictions, test_y_encoded)
# Extract F1 score and accuracy from the confusion matrix
best_f1_score <- best_conf_matrix$byClass["F1"]
best_accuracy <- best_conf_matrix$overall["Accuracy"]
# Print the results for the best model
print(paste("Best F1 Score:", best_f1_score))## [1] "Best F1 Score: 0.970464135021097"
## [1] "Best Accuracy: 0.94488188976378"
library(e1071)
library(pROC)
library(ggplot2)
set.seed(123)
# Train the SVM model using the best degree value with probability estimates
best_svm_model_poly <- svm(train_x_encoded, train_y_encoded, kernel = "poly", degree = best_degree, probability = TRUE)
# Predict on the test data using the best model with probability estimates
svm_prob_poly <- predict(best_svm_model_poly, test_x_encoded, probability = TRUE)
# Extract the probabilities for the positive class
svm_probabilities_poly <- attr(svm_prob_poly, "probabilities")[, "Class1"]
# Calculate the ROC data
roc_data_svm_poly <- roc(response = test_y_encoded, predictor = svm_probabilities_poly)## Setting levels: control = Class0, case = Class1
## Setting direction: controls < cases
# Calculate the AUC
auc_svm_poly <- auc(roc_data_svm_poly)
# Print the AUC
print(paste("AUC for SVM with Polynomial Kernel:", auc_svm_poly))## [1] "AUC for SVM with Polynomial Kernel: 0.959420289855072"
# Create a data frame for ggplot2
roc_df_svm_poly <- data.frame(
FPR = roc_data_svm_poly$specificities,
TPR = roc_data_svm_poly$sensitivities
)
# Plot ROC curve using ggplot2
roc_plot_svm_poly <- ggplot(roc_df_svm_poly, aes(x = 1 - FPR, y = TPR)) +
geom_line(color = "lightblue", size = 1) +
geom_abline(linetype = "dashed", color = "pink", size = 1) +
labs(title = "ROC Curve for SVM Model with Polynomial Kernel",
x = "False Positive Rate",
y = "True Positive Rate") +
theme_minimal()
# Print the ROC plot
print(roc_plot_svm_poly)1- Linear Kernel:
Best C Value: 0.1 F1 Score: 0.975 (approx.) Accuracy: 95.24% This kernel provides a high F1 score and accuracy, indicating excellent performance in classifying the instances correctly and a good balance between precision and recall.
2- Radial Basis Function (RBF) Kernel:
- Best Gamma Value: 0.0316
- F1 Score: 0.972 (approx.)
- Accuracy: 94.71%
- AUC: 0.907
The RBF kernel also performs well but slightly lags behind the linear kernel in both F1 score and accuracy. The AUC is lower compared to the linear kernel, suggesting less discriminative power.
3- Polynomial Kernel:
- Best Degree Value: 3
- F1 Score: 0.975 (approx.)
- Accuracy: 95.24%
- AUC: 0.879
The polynomial kernel matches the linear kernel in terms of F1 score and accuracy, but the AUC is significantly lower, which implies that when considering the trade-off between true positive rate and false positive rate, it doesn't perform as well.
The linear kernel is deemed the best due to several reasons:
- It has the highest AUC, indicating a better overall performance across all possible classification thresholds.
- It matches the highest F1 score, demonstrating an excellent balance between precision and recall.
- The kernel's simplicity likely prevents overfitting, making it more generalizable to unseen data.
Principal Component Analysis (PCA) is a powerful statistical technique that is used for dimensionality reduction while preserving as much of the variability in the data as possible. It achieves this by transforming the original set of features into a new set of uncorrelated variables, known as principal components, which are ordered by the amount of variance they capture from the data.
The principal components are linear combinations of the original features and are orthogonal to each other, ensuring that they capture distinct information or variance. PCA is particularly useful in processing data with a large number of features, which may be correlated, by reducing the feature space to a smaller set of meaningful components without a significant loss of information. This simplification makes it easier to visualize and interpret the data, as well as to apply further analysis or machine learning algorithms efficiently.
The first principal component accounts for the largest possible variance, and each succeeding component has the highest variance possible under the constraint that it is orthogonal to the preceding components. The number of principal components retained is typically chosen based on the goal of preserving a large percentage of the data’s total variance, often using a threshold like 95%.
PCA is widely used in exploratory data analysis, predictive modeling, and in the preprocessing of data for other algorithms that can benefit from dimensionality reduction.
## Warning: package 'factoextra' was built under R version 4.3.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
# Remove the target variable 'diabetes' from the dataset
df_encoded <- df_encoded[, !names(df_encoded) %in% c('diabetes','gender','hypertension','heart_disease')]
# Convert all factors to numeric for PCA
numeric_data <- df_encoded %>%
mutate(across(where(is.factor), ~ as.numeric(as.character(.))))
str(df_encoded)## 'data.frame': 637 obs. of 9 variables:
## $ age : num 53 57 21 19 35 44 58 33 34 78 ...
## $ bmi : num 27.3 27.3 27.3 25.9 27.3 ...
## $ HbA1c_level : num 4.5 8.2 3.5 6 6.1 4.5 6.6 7.5 6 5 ...
## $ blood_glucose_level : num 80 126 160 126 126 130 80 300 159 159 ...
## $ smoking_historycurrent : num 0 1 0 0 1 0 0 0 0 0 ...
## $ smoking_historyever : num 0 0 0 0 0 0 0 0 0 0 ...
## $ smoking_historyformer : num 0 0 0 0 0 0 1 0 0 1 ...
## $ smoking_historynever : num 1 0 0 1 0 1 0 1 1 0 ...
## $ smoking_historynot.current: num 0 0 1 0 0 0 0 0 0 0 ...
set.seed(123)
# Standardize the data
data_scaled <- scale(df_encoded[, sapply(df_encoded, is.numeric)])
# Apply PCA
pca_result <- prcomp(data_scaled, center = TRUE, scale. = TRUE)
#the summary gives the importance of components
summary(pca_result)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.3505 1.1867 1.0810 1.0528 1.0410 0.92665 0.89383
## Proportion of Variance 0.2026 0.1565 0.1298 0.1231 0.1204 0.09541 0.08877
## Cumulative Proportion 0.2026 0.3591 0.4890 0.6121 0.7325 0.82792 0.91669
## PC8 PC9
## Standard deviation 0.86590 7.581e-16
## Proportion of Variance 0.08331 0.000e+00
## Cumulative Proportion 1.00000 1.000e+00
set.seed(123)
## now make a fancy looking plot that shows the PCs and the variation:
library(ggplot2)
pca.data <- data.frame(Sample=rownames(pca_result$x),
X=pca_result$x[,1],
Y=pca_result$x[,2])
# Extract the proportion of variance explained by PC1 and PC2
pca.var.per <- pca_result$sdev^2 / sum(pca_result$sdev^2)
pca.var.per <- pca.var.per[1:2] * 100 # Convert to percentage
# View the loadings (contribution of each feature to each principal component)
loadings <- pca_result$rotation[]
# Inspect the loadings
print(loadings)## PC1 PC2 PC3 PC4
## age -0.3905023 -0.28922862 -0.160181534 0.01844079
## bmi -0.3505858 -0.27502240 0.095424210 0.08026827
## HbA1c_level -0.2390004 -0.31822832 0.355097571 0.25052560
## blood_glucose_level -0.2789475 -0.31457149 0.327544512 0.29622804
## smoking_historycurrent -0.1381520 0.59848709 0.100443150 0.53141206
## smoking_historyever -0.1193668 0.13003037 0.298695081 0.02038238
## smoking_historyformer -0.4164747 -0.08641117 -0.673725412 -0.08852382
## smoking_historynever 0.5820278 -0.49759974 0.008324383 0.10719913
## smoking_historynot.current -0.2100541 0.10157712 0.422579696 -0.73522817
## PC5 PC6 PC7 PC8
## age 0.09663946 -0.07356444 0.58410679 0.61800072
## bmi 0.07518802 0.73138830 0.20822541 -0.44940362
## HbA1c_level 0.04127350 -0.64211874 0.16933040 -0.45832239
## blood_glucose_level 0.06614560 0.10085725 -0.67699975 0.39410727
## smoking_historycurrent 0.33446785 0.04282809 0.12236731 0.04011749
## smoking_historyever -0.87208871 0.06273930 0.12018015 0.09249273
## smoking_historyformer -0.10917806 -0.15701859 -0.29982994 -0.19381889
## smoking_historynever 0.08759704 0.07284358 0.08820338 0.04545765
## smoking_historynot.current 0.29483604 -0.03793782 -0.03694770 0.03057888
## PC9
## age 1.149469e-16
## bmi 1.352249e-16
## HbA1c_level 4.433513e-18
## blood_glucose_level 3.069045e-16
## smoking_historycurrent -4.471615e-01
## smoking_historyever -3.028801e-01
## smoking_historyformer -4.396602e-01
## smoking_historynever -6.158340e-01
## smoking_historynot.current -3.684529e-01
From this output, we can infer that:
PC1 is the most important component, followed by PC2, PC3, and so on. The first six components together capture all the variance in the data. PC7 does not contribute to explaining the variance and can be disregarded.
Since we’re interested in a 2D visualization, we will focus on the first two principal components, which, according to our below results, account for approximately 47% of the variance (27.2% by PC1 and 20.4% by PC2). This will project our high-dimensional data into a 2D space.
set.seed(123)
ggplot(data=pca.data, aes(x=X, y=Y, label=Sample, colour=data$diabetes)) +
geom_point() +
scale_color_manual(values=c("pink", "lightblue")) + # Adjust colors as needed
xlab(paste("PC1 - ", pca.var.per[1], "%", sep="")) +
ylab(paste("PC2 - ", pca.var.per[2], "%", sep="")) +
theme_bw() +
ggtitle("My PCA Graph")The graph shows a clear separation along PC1, which accounts for 27.2% of the variance in the dataset. This suggests that PC1 is capturing a significant aspect of the data’s variability.
PC2 accounts for 20.4% of the variance, which is less than PC1 but still a substantial amount. There’s also some separation along this component, although it’s less pronounced than along PC1.
The points are color-coded based on the ‘diabetes’ variable, with one category represented in pink and the other in blue. From the plot, it seems that there is some degree of clustering based on the ‘diabetes’ status, but there is overlap between the two categories.
The clusters formed by the pink points (representing the ‘0’ category of diabetes status) appear to be more spread out than the blue points (representing the ‘1’ category). This could indicate that the ‘0’ category has greater variability in the dataset with respect to the principal components, or it could reflect different sample sizes between the groups.
There’s no clear boundary between the two categories, indicating that while the first two principal components capture some differences related to diabetes status, they are not completely distinct. This suggests that diabetes status is influenced by multiple factors, and not all of them can be fully captured by the first two principal components.
The overlap between the two categories implies that a higher-dimensional analysis or a more complex model might be necessary to better understand the differences between the two groups.
In order to visualize the results better, we will look at a 3D visualization using the first 3 components capturing approximately 64% of the variance.
## Warning: package 'plotly' was built under R version 4.3.2
K-means clustering is a popular unsupervised learning algorithm that is used to partition a dataset into K distinct, non-overlapping subgroups (or clusters) based on similarity. It’s considered unsupervised because the grouping is done without reference to any external labels or ground truth.
The process of K-means clustering involves the following steps: 1. Initialization: K initial “centroids” are chosen at random from the dataset. 2. Assignment: Each data point is assigned to the nearest centroid, and groups are formed. 3. Update: The centroids are recalculated as the center of the newly formed groups. 4. Iterate: Steps 2 and 3 are repeated until the centroids no longer move significantly, indicating that the groups are as consistent as possible.
K-means is widely used in market segmentation, document clustering, image segmentation, and other areas where data needs to be grouped efficiently. Its simplicity and speed are advantageous when dealing with large datasets.
k2 <- kmeans(df_encoded, centers = 2, nstart = 25)
k3 <- kmeans(df_encoded, centers = 3, nstart = 25)
k4 <- kmeans(df_encoded, centers = 4, nstart = 25)
k5 <- kmeans(df_encoded, centers = 5, nstart = 25)
# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = df_encoded) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = df_encoded) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = df_encoded) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = df_encoded) + ggtitle("k = 5")
library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)In the above graphs, we ran k-mean clustering with different k(2 to 5) in order to visualize the difference and hopefully find the best k.
It is normal to have some overlap in k-means cluster plots, especially when visualizing high-dimensional data that has been reduced to two dimensions for the purpose of visualization. This dimensional reduction can sometimes cause clusters that are actually separate in high-dimensional space to appear overlapped in two-dimensional space.
However, we will use two methods in order to find the best k for cluster separation.
The Elbow Method graph above shows the within groups sum of squares (WSS) for different numbers of clusters. The best number of clusters k is typically determined by looking for the “elbow” in the graph, which is the point where the WSS begins to decrease at a slower rate. This point indicates that adding more clusters does not significantly improve the fit of the model.
In the graph above, the WSS sharply drops from 1 to 2 and continues to decrease as the number of clusters increases. However, the rate of decrease lessens after k=3. While the WSS continues to decrease past k=3, the reductions are more marginal and indicate diminishing returns on the improvement of the cluster fit.
Therefore, according to this Elbow Method graph, k=3 appears to be the most appropriate number of clusters to choose for this dataset, as it is the point after which the decrease in WSS begins to level off, signifying the balance between the number of clusters and the compactness of the clustering.
The silhouette method provides insight into the separation distance between the resulting clusters. A higher silhouette width indicates better cluster definition.
In the silhouette analysis graph:
The silhouette score for \(k=2\) is the highest, suggesting that the data points are well matched to their own clusters. However, this doesn’t necessarily mean it’s the best choice.
While the silhouette score peaks at \(k=2\), the score for \(k=3\) is also relatively high, which could be indicative of a good cluster structure with meaningful separation.
The silhouette score for \(k=3\) is above the average threshold, suggesting that the clusters are distinct and data points are closer to their own cluster center than to others.
As \(k\) increases beyond 3, the silhouette scores tend to decrease and plateau, indicating that additional clusters might not provide more distinct groupings and may start to segment the data unnecessarily.
Considering these factors, and based on domain knowledge or other contextual information that suggests \(k=3\) is optimal, we would conclude that three clusters provide a meaningful and interpretable separation of the data. This balances the statistical evidence from the silhouette scores with practical considerations for the dataset’s application.
Based on all of the above, we concluded that k=3 is most suitable for clustering our data.
Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. It is commonly used in data analysis to organize data into tree-like structures (dendrograms) that reveal relationships and hierarchies among different data points.
How itWorks:
Initialization: Each data point starts as its own cluster.
Similarity Measurement: Calculate distances between all pairs of clusters using a chosen metric (like Euclidean distance).
Cluster Merging: Repeatedly merge the closest pair of clusters.
Hierarchy Formation: Each merge forms part of a hierarchy represented as a tree structure (dendrogram).
Termination: The process continues until all data points are merged into a single cluster.
Number of Clusters: Decide the number of clusters by cutting the dendrogram at a desired level.
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## i The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We have chosen to employ the average linkage method in our hierarchical
clustering approach. This decision was based on the average linkage
method’s ability to balance the extremities presented by single and
complete linkage methods. In average linkage, the distance between two
clusters is defined as the average distance between all pairs of points
in the respective clusters. This approach offers a more comprehensive
and robust understanding of cluster distances, as it considers all
inter-cluster data point relationships, thus reducing the impact of
outliers and noise in the data.
Furthermore, we opted to cut the dendrogram at k=3, mirroring our previous decision in the k-means clustering method. This decision is guided by the insights gleaned from our k-means analysis, which indicated that three clusters provide a meaningful and interpretable segmentation of our data. By maintaining consistency in the number of clusters between the two methods, we aim to facilitate a more coherent comparison and validation of our clustering results, thus reinforcing the reliability and accuracy of our overall analysis.
In conclusion: * We first started by introducing the topic at hand, which is diabetes mellitus and the different features provided in our dataset that might help us predict if a person has the disease or not.
Next, we delved deeper into the dataset. Indeed, we started by selecting only a subset of the records for computational efficiency purposes. Furthermore, we used graphs in order to visualize our data as they provide an accessible way to see and understand trends, outliers, and patterns in data. Based on this analysis, we concluded that F1 score would be better than accuracy as a metric to asses model performance. This is due to the significant class imbalance observed in the dataset.
Furthermore, we moved onto data preprocessing. We started by removing the observations where no information was available for smoking history in order to avoid the possible negative repercussions that can arise from including data points with missing information in the dataset. Next, we converted gender from categorical into numerical binary values, as well as one-hot encoding smoking history. Finally, we produced a heat map that helped us get a feel for the possible correlations between the features in the dataset and the response.
We then split the data into training and test sets, and went on to selecting the relevant features. The latter was done using the Random Forest algorithm. After obtaining the results and examining the importance scores of each of the features, we decided to omit gender, heart disease, and hypertension, as they were the least important.
After running the Random Forest exuding and including the previously mentioned features, it was confirmed that, indeed, they did not have important prediction power, as both the accuracy and F1 score did not vary much with or without them.
Still conducting tree-based methods, we moved onto bagging, where we obtained an F1 score of 0.8 and an accuracy of around 96%.
Next, we built pruned and unpruned decision trees. The pruned and unpruned trees had respective F1 scores of 98% and 96%, which are both good. However the pruned tree still performed slightly better.
To Recap a bit concerning the tree based methods:
The ROC curves of the methods indicated best performance in Random Forest and worst in the Pruned Decision Tree, out of the three tree-based methods that we tried.
While the ROC curves showed better performance in Bagging than for the pruned Decision Tree, the F1 scores demonstrated the opposite. Still, all in all, the best performing model was the Random Forest.
Moving on to support vector machines, let us recap the findings concerning each kernel:
The Linear Kernel exhibited a high F1 score of around 0.975, and accuracy of 95.24% indicating that the model is extremely reliable. Moreover, the AUC of 0.968 indicated very good performance as well.
The Radial Kernel (having a best gamma value of 0.031) showed good performance as well, with an F1 score of approximately 97% as well as accuracy of 94.7%. Moreover, in accordance with these, the AUC of 0.9 reinforces the model’s robustness in making the correct predictions across both classes.
The Polynomial Kernel, which has a best degree value of 3 has a F1 score of around 0.975, accuracy of 95.24% and AUC of 0.879. This is still considered good performance.
Comparing the Support Vector Machines we can see that the linear kernel performs best, while the radial kernel comes in second place, and polynomial kernel in third place.This can indicate that our data is linearly separable
After developing our models, we conducted Principal Component Analysis (PCA). We found that there is no clear boundary between the two categories (diabetic(1) and non diabetic(0)). The lack of an evident boundary between the two groups suggests that although the first two principal components capture some variations linked to diabetes status, they are not entirely different from one another. This shows that there are more factors influencing diabetes status than just the first two major components, and that these factors are not entirely accounted for. Given the overlap between the two categories, a more sophisticated model or a higher-dimensional analysis may be required in order to fully comprehend the distinctions between the two groups.
The final approach we will be conducting is the Clustering approach.
Starting with K-Means Clustering. After conducting a few methods that aim to find the optimal K, we concluded that a K=3 was the most suitable for our data.
Moving on to Hierarchical Clustering, we were able to build a dendrogram, employing the average linkage method. Staying consistent with the results of K-means clustering, we have opted to cut the dendrogram at k=3, obtaining a clear distinction between the three classes.
All things considered, : Machine learning-based models for diabetes identification are of high importance as they have the potential to revolutionize diabetes care by enabling early detection, personalized interventions, and improved health outcomes. This could significantly reduce the burden of diabetes on individuals, healthcare systems, and society as a whole.